Parkinson’s Disease (PD) is a degenerative neurological disorder marked by decreased dopamine levels in the brain. It manifests itself through a deterioration of movement, including the presence of tremors and stiffness. There is commonly a marked effect on speech, including dysarthria (difficulty articulating sounds), hypophonia (lowered volume), and monotone (reduced pitch range). Additionally, cognitive impairments and changes in mood can occur, and risk of dementia is increased.
Traditional diagnosis of Parkinson’s Disease involves a clinician taking a neurological history of the patient and observing motor skills in various situations. Since there is no definitive laboratory test to diagnose PD, diagnosis is often difficult, particularly in the early stages when motor effects are not yet severe. Monitoring progression of the disease over time requires repeated clinic visits by the patient. An effective screening process, particularly one that doesn’t require a clinic visit, would be beneficial. Since PD patients exhibit characteristic vocal features, voice recordings are a useful and non-invasive tool for diagnosis. If machine learning algorithms could be applied to a voice recording dataset to accurately diagnosis PD, this would be an effective screening step prior to an appointment with a clinician
Medicine
name - ASCII subject name and recording number
MDVP:Fo(Hz) - Average vocal fundamental frequency
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
MDVP:Flo(Hz) - Minimum vocal fundamental frequency
MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency
MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA - Several measures of variation in amplitude
NHR, HNR - Two measures of ratio of noise to tonal components in the voice
Status - Health status of the subject (one) - Parkinson's, (zero) - healthy
RPDE, D2 - Two nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1, spread2, PPE - Three nonlinear measures of fundamental frequency
Goal is to classify the patients into the respective labels using the attributes from their voice recordings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_auc_score, roc_curve, auc
from scipy import stats
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from mlxtend.classifier import StackingClassifier
%matplotlib inline
sns.set(color_codes = True)
df = pd.read_csv("Data-Parkinsons.csv")
df.shape
So there are only 195 records (rows) and 24 columns
df.head(10)
df.info()
df.status.value_counts()
There are a total of 24 attributes. 22 of these are continuous or numerical, the target attribute 'status' is of type Boolean and the attribute 'name' is categorical.
df.describe().T
The independent attributes are measured in different scales Hz, DB, %, Absolute value(MDVP:Jitter(Abs)) .... Hence we will have to use a scaling techinque to scale different quantities of measurement.
The count of each attribute is 195 implying there are no missing values in the data-set</b
df.isnull().sum()
There are indeed no missing values in any of the columns of the data-set
Summary :
1. There are only 195 records which is a relatively small data-set
2. There are a total of 24 attributes out of which 22 are continuous or numerical, the target attribute 'status' is of type Boolean and the attribute 'name' is categorical.
3. There are no missing values in any of the attributes in the data-set </font>
Challenges :
1. The independent attributes are measured in different scales and hence the data-set needs to be standardized to a common scale
2. Out of 195 records 147 are Positive while only 48 are negative. The data-set is heavily biased towards positive cases
df.hist(bins=20, figsize=(28,28)) ;
All the continuous attributes are normally distributed. Most of the attributes seem to have positive skewness
fig, ax = plt.subplots(2,3,figsize=(16,8))
sns.distplot(df['MDVP:Fo(Hz)'] ,hist=False, ax=ax[0][0]) ;
sns.distplot(df['MDVP:Fhi(Hz)'] ,hist=False, ax=ax[0][1]) ;
sns.distplot(df['MDVP:Flo(Hz)'] ,hist=False,ax=ax[0][2]);
sns.distplot(df['MDVP:Jitter(%)'] ,hist=False,ax=ax[1][0]);
sns.distplot(df['MDVP:Jitter(Abs)'],hist=False,ax=ax[1][1]);
sns.distplot(df['MDVP:RAP'] ,hist=False,ax=ax[1][2]);
print ('Skewness of [MDVP:Fo(Hz)] attribute :', round(df['MDVP:Fo(Hz)'].skew(),3))
print ('Skewness of [MDVP:Fhi(Hz)] attribute :', round(df['MDVP:Fhi(Hz)'].skew(),3))
print ('Skewness of [MDVP:Flo(Hz)] attribute :', round(df['MDVP:Flo(Hz)'].skew(),3))
print ('Skewness of [MDVP:Jitter(%)] attribute :', round(df['MDVP:Jitter(%)'].skew(),3))
print ('Skewness of [MDVP:Jitter(Abs)] attribute :', round(df['MDVP:Jitter(Abs)'].skew(),3))
print ('Skewness of [MDVP:RAP] attribute :', round(df['MDVP:RAP'].skew(),3))
fig, ax = plt.subplots(2,3,figsize=(16,8))
sns.distplot(df['MDVP:PPQ'] ,hist=False, ax=ax[0][0]) ;
sns.distplot(df['Jitter:DDP'] ,hist=False, ax=ax[0][1]) ;
sns.distplot(df['MDVP:Shimmer'] ,hist=False,ax=ax[0][2]);
sns.distplot(df['MDVP:Shimmer(dB)'],hist=False,ax=ax[1][0]);
sns.distplot(df['Shimmer:APQ3'] ,hist=False,ax=ax[1][1]);
sns.distplot(df['Shimmer:APQ5'] ,hist=False,ax=ax[1][2]);
print ('Skewness of [MDVP:PPQ] attribute :', round(df['MDVP:PPQ'].skew(),3))
print ('Skewness of [Jitter:DDP] attribute :', round(df['Jitter:DDP'].skew(),3))
print ('Skewness of [MDVP:Shimmer] attribute :', round(df['MDVP:Shimmer'].skew(),3))
print ('Skewness of [MDVP:Shimmer(dB)] attribute :', round(df['MDVP:Shimmer(dB)'].skew(),3))
print ('Skewness of [Shimmer:APQ3] attribute :', round(df['Shimmer:APQ3'].skew(),3))
print ('Skewness of [Shimmer:APQ5] attribute :', round(df['Shimmer:APQ5'].skew(),3))
fig, ax = plt.subplots(2,3,figsize=(16,8))
sns.distplot(df['MDVP:APQ'] ,hist=False, ax=ax[0][0]) ;
sns.distplot(df['Shimmer:DDA'] ,hist=False, ax=ax[0][1]) ;
sns.distplot(df['NHR'] ,hist=False,ax=ax[0][2]);
sns.distplot(df['HNR'] ,hist=False,ax=ax[1][0]);
sns.distplot(df['RPDE'] ,hist=False,ax=ax[1][1]);
sns.distplot(df['DFA'] ,hist=False,ax=ax[1][2]);
print ('Skewness of [MDVP:APQ] attribute :', round(df['MDVP:APQ'].skew(),3))
print ('Skewness of [Shimmer:DDA] attribute :', round(df['Shimmer:DDA'].skew(),3))
print ('Skewness of [NHR] attribute :', round(df['NHR'].skew(),3))
print ('Skewness of [HNR] attribute :', round(df['HNR'].skew(),3))
print ('Skewness of [RPDE] attribute :', round(df['RPDE'].skew(),3))
print ('Skewness of [DFA] attribute :', round(df['DFA'].skew(),3))
fig, ax = plt.subplots(1,4,figsize=(16,4))
sns.distplot(df['spread1'] ,hist=False, ax=ax[0]) ;
sns.distplot(df['spread2'] ,hist=False, ax=ax[1]) ;
sns.distplot(df['D2'] ,hist=False,ax=ax[2]);
sns.distplot(df['PPE'] ,hist=False,ax=ax[3]);
print ('Skewness of [spread1] attribute :', round(df['spread1'].skew(),3))
print ('Skewness of [spread2] attribute :', round(df['spread2'].skew(),3))
print ('Skewness of [D2] attribute :', round(df['D2'].skew(),3))
print ('Skewness of [PPE] attribute :', round(df['PPE'].skew(),3))
df.plot(kind='box', subplots=True, layout=(5,5), sharex=False, sharey=False, figsize=(20, 20))
plt.show()
Most of the independent continuous attributes have outliers. Given these are from medicine domain we will not be treating the outliers since they may be genuine readings.
sns.pairplot(df, hue='status');
Let us now study the impact of each attribute on the target variable
fig, ax = plt.subplots(1,3,figsize=(16,6))
sns.boxplot(x = 'status', y = 'MDVP:Fo(Hz)' ,data = df,ax=ax[0]) ;
sns.boxplot(x = 'status', y = 'MDVP:Fhi(Hz)',data = df,ax=ax[1]) ;
sns.boxplot(x = 'status', y = 'MDVP:Flo(Hz)',data = df,ax=ax[2]) ;
The Median of all the 3 attributes for Healthy people is higher than that for those having the disease. The median of average vocal fundamental frequency (MDVP:Fo) is around 200Hz for healthy people while the median value is 145Hz for those affected by the disease.
fig, ax = plt.subplots(2,3,figsize=(16,12))
sns.boxplot(x = 'status', y = 'MDVP:Jitter(%)' ,data = df,ax=ax[0][0]) ;
sns.boxplot(x = 'status', y = 'MDVP:Jitter(Abs)',data = df,ax=ax[0][1]) ;
sns.boxplot(x = 'status', y = 'MDVP:RAP' ,data = df,ax=ax[0][2]) ;
sns.boxplot(x = 'status', y = 'MDVP:PPQ' ,data = df,ax=ax[1][0]) ;
sns.boxplot(x = 'status', y = 'Jitter:DDP' ,data = df,ax=ax[1][1]) ;
sns.boxplot(x = 'status', y = 'DFA' ,data = df,ax=ax[1][2]) ;
The median of various measures of variation in fundamental frequency is lower for healthy people compared to the people suffering from Parkinsons disease.
fig, ax = plt.subplots(2,3,figsize=(16,12))
sns.boxplot(x = 'status', y = 'MDVP:Shimmer' ,data = df,ax=ax[0][0]) ;
sns.boxplot(x = 'status', y = 'MDVP:Shimmer(dB)',data = df,ax=ax[0][1]) ;
sns.boxplot(x = 'status', y = 'Shimmer:APQ3' ,data = df,ax=ax[0][2]) ;
sns.boxplot(x = 'status', y = 'Shimmer:APQ5' ,data = df,ax=ax[1][0]) ;
sns.boxplot(x = 'status', y = 'MDVP:APQ' ,data = df,ax=ax[1][1]) ;
sns.boxplot(x = 'status', y = 'Shimmer:DDA' ,data = df,ax=ax[1][2]) ;
The median of various measures of variation in amplitude is also lower for healthy people compared to the people suffering from the disease.
fig, ax = plt.subplots(1,2,figsize=(12,6))
sns.boxplot(x = 'status', y = 'NHR',data = df,ax=ax[0]) ;
sns.boxplot(x = 'status', y = 'HNR',data = df,ax=ax[1]) ;
fig, ax = plt.subplots(1,2,figsize=(12,6))
sns.boxplot(x = 'status', y = 'RPDE',data = df,ax=ax[0]) ;
sns.boxplot(x = 'status', y = 'D2' ,data = df,ax=ax[1]) ;
fig, ax = plt.subplots(1,3,figsize=(16,6))
sns.boxplot(x = 'status', y = 'spread1',data = df,ax=ax[0]) ;
sns.boxplot(x = 'status', y = 'spread2',data = df,ax=ax[1]) ;
sns.boxplot(x = 'status', y = 'PPE' ,data = df,ax=ax[2]) ;
There is overlap in values of spread1 and spread2 between healthy and sick people. However, the healthy people tend to have lower values for both the attributes compared to the ones having the disease.
corr = df.corr()
sns.set_context("notebook", font_scale=1.0, rc={"lines.linewidth": 2.5}) ;
plt.figure(figsize=(25,25))
sns.heatmap(corr, annot=True) ;
The various MDVP:Jitter and MDVP:Shimmer attributes have a high correlation between them. This is possibly due to the fact that these attributes are various measures of same basic feature (in this case voice amplitude and frequency)
Summary : </font>
1. All the continuous attributes are normally distributed. Most of the attributes seem to have positive skewness
2. The attributes HNR, RPDE and DFA have negative skewness while all other attributes have positive skewness
3. Visibly long tails for MDVP:Fhi(Hz), MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP, MDVP:Shimmer(dB), MDVP:APQ and NHR attributes and the same can be confirmed by looking at the skew values for the mentioned attributes
4. Most of the independent continuous attributes have outliers. Given these are from medicine domain we will not be treating the outliers since they may be genuine readings
5. The median value of vocal fundamental frequency attributes for Healthy people is higher than that for those having the disease. The median of average vocal fundamental frequency (MDVP:Fo) is around 200Hz for healthy people while the median value is 145Hz for those affected by the disease
6. The median of various measures of variation in fundamental frequency and amplitude is lower for healthy people compared to the people suffering from Parkinsons disease
7. There is overlap in values of spread1 and spread2 between healthy and sick people. However, the healthy people tend to have lower values for both the attributes compared to the ones having the disease
8. The various MDVP:Jitter and MDVP:Shimmer attributes have a high correlation between them. This is possibly due to the fact that these attributes are various measures of same basic feature (in this case voice amplitude and frequency)
9. The Average vocal fundamental frequency is negatively correlated to the various measures of variation in fundamental frequency
10. The Average vocal fundamental frequency is negatively correlated to the various measures of variation in amplitude
# We will drop the name column since it is used to uniquely identify patients and does not impact the target variable
X = df.drop(['name','status'], axis=1)
y = df['status']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)
print("Number of people who DO NOT have PD : {0} ({1:0.2f}%)".format(len(df.loc[df['status'] == 0]), (len(df.loc[df['status'] == 0])/len(df.index)) * 100))
print("Number of people who have PD : {0} ({1:0.2f}%)".format(len(df.loc[df['status'] == 1]), (len(df.loc[df['status'] == 1])/len(df.index)) * 100))
print("")
print("People in training set who DO NOT have PD : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("People in training set who have PD : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("")
print("People in test set who DO NOT have PD : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("People in test set who have PD : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("\n\n")
Due to varying magnitudes and units of measurements like Hertz, Decibel and percentage, it is advisable to scale our training and test data-sets
scaler = preprocessing.StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
pd.DataFrame(X_train_scaled).head()
pd.DataFrame(X_test_scaled).head()
The data is scaled and there are no missing values. So we are ready to use the data for the model.
logreg = LogisticRegression(solver='liblinear')
# Fit the logistic regression model on scaled training data
logreg.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(logreg.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
logreg_y_predict = logreg.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(logreg.score(X_test_scaled, y_test),2))
print("Logistic Regression Accuracy : ",round(metrics.accuracy_score(y_test, logreg_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, logreg_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, logreg_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
NB = GaussianNB()
# Fit the Naive Bayes model on scaled training data
NB.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(NB.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
NB_y_predict = NB.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(NB.score(X_test_scaled, y_test),2))
print("Naive Bayes Accuracy : ",round(metrics.accuracy_score(y_test, NB_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, NB_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, NB_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
knn = KNeighborsClassifier(n_neighbors=3)
# Fit the KNN model on scaled training data
knn.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(knn.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
knn_y_predict = knn.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(knn.score(X_test_scaled, y_test),2))
print("K-Nearest Neighbours Accuracy : ",round(metrics.accuracy_score(y_test, knn_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, knn_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, knn_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
SVM = svm.SVC(gamma=0.025, C=3)
# Fit the SVM model on scaled training data
SVM.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(SVM.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
svm_y_predict = SVM.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(SVM.score(X_test_scaled, y_test),2))
print("Support Vector Machines Accuracy : ",round(metrics.accuracy_score(y_test, svm_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, svm_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, svm_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=42)
# Fit the Decision Tree model on scaled training data
dTree.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(dTree.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
dTree_y_predict = dTree.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(dTree.score(X_test_scaled, y_test),2))
print("Decision Trees Accuracy : ",round(metrics.accuracy_score(y_test, dTree_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, dTree_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, dTree_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
| Model Name | Model Accuracy on Test Data | False Positive | False Negative |
| Logistic Regression | 86.44 % | 6 | 2 |
| Naive Bayes | 77.97 % | 6 | 7 |
| K-Nearest Neighbours | 91.53 % | 5 | 0 |
| Support Vector Machines | 88.14 % | 7 | 0 |
| Decision Trees | 86.44 % | 5 | 3 |
The KNN with 3 nearest neighbours gives the best accuracy on Test data. Also, KNN and SVM are the only two models which have zero False Negatives which is important in this case
clf1 = KNeighborsClassifier(n_neighbors=3)
clf2 = DecisionTreeClassifier(criterion = 'gini', random_state=42)
clf3 = GaussianNB()
clf4 = svm.SVC(gamma=0.025, C=3)
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3, clf4, lr], meta_classifier = lr)
for clf, label in zip([clf1, clf2, clf3, clf4, lr, sclf],
['KNN',
'Decision Tree',
'Naive Bayes',
'Support Vector Machine',
'Logistic Regression',
'StackingClassifier']):
scores = model_selection.cross_val_score(clf, X_train_scaled, y_train, cv=7, scoring='accuracy')
#print (scores)
print("Accuracy on [Training] data: %0.2f (+/- %0.2f) [%s]" % (scores.mean()*100, scores.std(), label))
# Predict on scaled test data set
sclf_fit = sclf.fit(X_train_scaled, y_train)
sclf_predict = sclf_fit.predict(X_test_scaled)
print("Stacking Meta-Classifier Accuracy : ",round(metrics.accuracy_score(y_test, sclf_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, sclf_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, sclf_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
The Stacking Classifier has an accuracy of 88.16 % on Training data-set and 91.53% accuracy on Test data-set.
Also, the False Negatives is 0 on Test data as predicted by the Stacking Classifier model
rfclf = RandomForestClassifier(n_estimators = 70)
# Fit the Random Forest model on scaled training data
rfclf.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(rfclf.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
rfclf_y_predict = rfclf.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(rfclf.score(X_test_scaled, y_test),2))
print("Random Forest Accuracy : ",round(metrics.accuracy_score(y_test, rfclf_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, rfclf_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, rfclf_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
bgclf = BaggingClassifier(n_estimators=100, max_samples= .7, bootstrap=True)
# Fit the Bagging Classifier model on scaled training data
bgclf.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(bgclf.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
bgclf_y_predict = bgclf.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(bgclf.score(X_test_scaled, y_test),2))
print("Bagging Classifier Accuracy : ",round(metrics.accuracy_score(y_test, bgclf_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, bgclf_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, bgclf_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
abclf = AdaBoostClassifier( n_estimators = 50)
# Fit the AdaBoost classifier model on scaled training data
abclf.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(abclf.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
abclf_y_predict = abclf.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(abclf.score(X_test_scaled, y_test),2))
print("AdaBoost Classifier Accuracy : ",round(metrics.accuracy_score(y_test, abclf_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, abclf_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, abclf_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
gbclf = GradientBoostingClassifier(n_estimators = 70, learning_rate = 1.0)
# Fit the GradientBoost classifier model on scaled training data
gbclf.fit(X_train_scaled, y_train)
print ("Score of [Training] data set : ",round(gbclf.score(X_train_scaled, y_train),2))
# Predict on scaled test data set
gbclf_y_predict = gbclf.predict(X_test_scaled)
print ("Score of [Test] data set : ",round(gbclf.score(X_test_scaled, y_test),2))
print("GradientBoost Classifier Accuracy : ",round(metrics.accuracy_score(y_test, gbclf_y_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, gbclf_y_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, gbclf_y_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
# Training classifiers
dt_clf = DecisionTreeClassifier(max_depth=4)
knn_clf = KNeighborsClassifier(n_neighbors=3)
svm_clf = svm.SVC(kernel='rbf', probability=True, C=3, gamma = 0.025)
lr_clf = LogisticRegression(solver='liblinear')
vot_clf = VotingClassifier(estimators=[('dt', dt_clf), ('knn', knn_clf), ('svm', svm_clf), ('logreg', lr_clf)],
voting='soft', weights=[1,1,1,1])
# Train the voting classifier on train data-set
vot_clf.fit(X_train_scaled, y_train)
# Predict on scaled test data set
vot_clf_predict = vot_clf.predict(X_test_scaled)
print("Voting Classifier Accuracy : ",round(metrics.accuracy_score(y_test, vot_clf_predict)*100,2),'%' )
#Print the classification report
print ("Classification Report :")
print ("-----------------------")
print (metrics.classification_report(y_test, vot_clf_predict))
print ("\n")
#Print the Confusion matrix
cm = metrics.confusion_matrix(y_test, vot_clf_predict)
print (cm)
print ('\n')
sns.heatmap(cm, annot=True, fmt='.2f', cmap='Blues', xticklabels=[0,1], yticklabels=[0,1]);
plt.xlabel('Predicted')
plt.ylabel('Actuals / Truth')
| Model Name | Model Accuracy on Test Data | False Positive | False Negative |
| Stacking Meta-classifier | 91.53 % | 5 | 0 |
| Random Forest | 89.83 % | 6 | 0 |
| Bagging Classifier | 91.53 % | 5 | 0 |
| AdaBoost Classifier | 89.83 % | 5 | 1 |
| GradientBoost Classifier | 89.83 % | 5 | 1 |
| Voting Classifier | 93.22 % | 4 | 0 |
Summary of all the Supervised learning Models (Basic and Ensemble techniques) is as follows :
| Model Name | Model Accuracy on Test Data | False Positive | False Negative |
| Logistic Regression | 86.44 % | 6 | 2 |
| Naive Bayes | 77.97 % | 6 | 7 |
| K-Nearest Neighbours | 91.53 % | 5 | 0 |
| Support Vector Machines | 88.14 % | 7 | 0 |
| Decision Trees | 86.44 % | 5 | 3 |
| Stacking Meta-classifier | 91.53 % | 5 | 0 |
| Random Forest | 89.83 % | 6 | 0 |
| Bagging Classifier | 91.53 % | 5 | 0 |
| AdaBoost Classifier | 89.83 % | 5 | 1 |
| GradientBoost Classifier | 89.83 % | 5 | 1 |
| Voting Classifier | 93.22 % | 4 | 0 |
False positives in medical domain can be neglected to an extent as misclassifications here may further be diagnosed. In other words we are not going to miss much by saying a healthy person to be positive of Parkinsons as further diagnosis is going to clarify the symptoms.
False Negatives should be highly condsidered because if we classify a Parkinsons positive as healthy then the person may not be treated and this could further worsen their well being
From the summary chart, the Naive Bayes is the worst model for the given data-set
The KNN with 3 nearest neighbours, Stacking model with Logisic regression as Meta classifier and Bagging classifier all have accuracy of 91.53% on Test data with 0 False Negatives.
The Voting classifier also has 0 False Negatives but a slightly higher accuracy of 93.22%
Considering the accuracy, Precision and Recall, the Voting classifier is the best model to predict the onset of Parkinsons based on voice recording data.